USHIFT-6401, USHIFT-6788: Add fail-fast RBAC bootstrap hook deadlock detection#6471
USHIFT-6401, USHIFT-6788: Add fail-fast RBAC bootstrap hook deadlock detection#6471copejon wants to merge 3 commits intoopenshift:mainfrom
Conversation
|
@copejon: This pull request references USHIFT-6401 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds concurrent RBAC bootstrap hook deadlock detection to the kube-apiserver Run: a detector probes /readyz/poststarthook/rbac/bootstrap-roles and etcd health, signals confirmed deadlock, attempts recovery by stopping Changes
Sequence DiagramsequenceDiagram
participant Run as Run Method
participant Poll as Readiness Polling
participant Detector as RBAC Detector
participant Hook as /readyz/poststarthook/rbac/bootstrap-roles
participant Etcd as Etcd (v3)
participant Systemctl as systemctl
Run->>Poll: start readiness polling
Run->>Detector: start RBAC deadlock detector goroutine
Detector->>Detector: wait initial delay
loop periodic checks
Detector->>Hook: GET readiness endpoint
Detector->>Etcd: short-timeout Status probe (TLS)
Etcd-->>Detector: health response
alt Etcd healthy & Hook remains not-ready past threshold
Detector->>Systemctl: run "systemctl stop microshift-etcd.scope"
Systemctl-->>Detector: result
Detector->>Run: signal deadlock detected (close channel)
Run->>Run: log error, attempt recovery, return deadlock error
else Hook ready or threshold not reached
Detector->>Detector: continue polling
end
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/apparentlymart/go-cidr@v1.1.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/go-systemd@v0.0.0-20190321100706-95778dfbb74e: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/google/go-cmp@v0.7.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/miekg/dns@v1.1.63: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/openshift/api@v0.0.0-20260309155933-45fd88d185dd: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/openshift/build-machinery-go@v0.0.0-20251023084048-5d77c1a5e5af: is explicitly required in go.mod, but not marked as explicit ... [truncated 29518 characters] ... belet: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-cli-plugin: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-controller: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: copejon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/controllers/kube-apiserver.go (1)
542-557: Function name is misleading.
restartMicroshiftEtcdScopeonly stops the scope; the restart occurs when MicroShift itself restarts afterRun()returns an error. Consider renaming tostopMicroshiftEtcdScopefor clarity.♻️ Suggested rename
-// restartMicroshiftEtcdScope restarts the microshift-etcd.scope to recover from deadlock. -// This forces a clean restart of etcd which can help break the circular dependency. -func restartMicroshiftEtcdScope() error { +// stopMicroshiftEtcdScope stops the microshift-etcd.scope to recover from deadlock. +// MicroShift will restart both etcd and kube-apiserver after Run() returns an error. +func stopMicroshiftEtcdScope() error { klog.Info("Stopping microshift-etcd.scope for recovery")Also update the call site at line 438.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/kube-apiserver.go` around lines 542 - 557, The function restartMicroshiftEtcdScope is misnamed because it only stops the scope; rename the function to stopMicroshiftEtcdScope (and update its comment and klog messages accordingly) and update every call site that invokes restartMicroshiftEtcdScope to call stopMicroshiftEtcdScope instead; ensure exported/internal visibility remains unchanged and run tests/build to verify no references remain to the old name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 488-496: The current RBAC bootstrap loop resets checkCount to 0
when etcd is unhealthy which allows flapping to prevent ever reaching maxChecks;
modify the logic in the kube-apiserver.go RBAC detection loop (around variables
checkCount, maxChecks, and etcdHealthy) to track a wall-clock deadline (e.g.,
startTime or deadline) when the check begins and do not reset that deadline on
etcd state changes—always compare time.Since(startTime) (or
time.Now().After(deadline)) against a configured maxWaitDuration and fail/exit
the loop if exceeded; keep the existing checkCount logic for deadlock detection
but add this independent timeout check so flapping cannot extend detection
indefinitely.
---
Nitpick comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 542-557: The function restartMicroshiftEtcdScope is misnamed
because it only stops the scope; rename the function to stopMicroshiftEtcdScope
(and update its comment and klog messages accordingly) and update every call
site that invokes restartMicroshiftEtcdScope to call stopMicroshiftEtcdScope
instead; ensure exported/internal visibility remains unchanged and run
tests/build to verify no references remain to the old name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 8e1dd33d-45f9-439d-91af-e5c44c9150fc
📒 Files selected for processing (1)
pkg/controllers/kube-apiserver.go
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 469-472: The loop that decides to signal deadlock is incrementing
checkCount and falling through to close(deadlockDetected) on
rbacHookMaxWaitDuration expiry even when isEtcdHealthy() or the RBAC probe
errored; modify the loop (the block around checkCount, isEtcdHealthy,
rbacHookMaxWaitDuration, maxChecks, and close(deadlockDetected)) so that
checkCount is only incremented when the "healthy-etcd stall" predicate is
actually observed (i.e., when both the RBAC probe is unfinished AND
isEtcdHealthy() returns true), ignore/skip increments when probes or health
checks error, and on hitting rbacHookMaxWaitDuration return early unless
checkCount has already reached maxChecks; ensure close(deadlockDetected) is only
called after the predicate has been confirmed maxChecks times.
- Around line 483-486: The RBAC readiness probe call using
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx)...
uses the long-lived controller ctx and can hang; wrap that request in a short
per-probe context with timeout (e.g., context.WithTimeout(ctx, 2*time.Second)),
use the derived probeCtx in Do(probeCtx), and defer cancel() so the goroutine
unblocks and the 2s interval / 30s max-wait logic can re-evaluate; update the
call that sets status and err (StatusCode(&status).Error()) to use the probe
context.
- Around line 558-563: The call to stopCmd.CombinedOutput() in
restartMicroshiftEtcdScope is unbounded and can hang; wrap the systemctl
invocation with a context deadline (e.g., create a context.WithTimeout and use
exec.CommandContext) and use that context when constructing stopCmd, then read
CombinedOutput (or CombinedOutput on the context-backed command) and return a
timeout-aware error if the context expires; reference
restartMicroshiftEtcdScope, stopCmd and CombinedOutput and ensure you cancel the
context and propagate/annotate context timeout errors in the returned fmt.Errorf
message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 6d409d90-afaf-448d-81e1-02888bc827c6
📒 Files selected for processing (1)
pkg/controllers/kube-apiserver.go
|
@copejon: This pull request references USHIFT-6401 which is a valid jira issue. This pull request references USHIFT-6788 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@copejon: This pull request references USHIFT-6401 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead. This pull request references USHIFT-6788 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/controllers/kube-apiserver.go (1)
468-515:⚠️ Potential issue | 🟠 MajorWall-clock deadline break still triggers false deadlock signal.
When
rbacHookMaxWaitDurationexpires at line 472, the loop breaks and falls through toclose(deadlockDetected)at line 515—even if etcd was never confirmed healthy. The "etcd healthy + hook stuck" predicate may never have been observed.Additionally,
checkCount++at line 481 executes before conditions are verified. IfisEtcdHealthyerrors, the count advances without observing the deadlock predicate.Proposed fix
+ deadlockPredicateObserved := false + for checkCount < maxChecks { // Check absolute deadline first - this cannot be reset by etcd state changes if time.Since(startTime) >= rbacHookMaxWaitDuration { klog.Errorf("RBAC bootstrap hook exceeded maximum wait duration of %v", rbacHookMaxWaitDuration) - break + if !deadlockPredicateObserved { + klog.Warning("Wall-clock deadline reached but deadlock predicate never confirmed - not signaling deadlock") + return + } + break // predicate was observed, signal deadlock } select { case <-ctx.Done(): return case <-time.After(rbacHookCheckInterval * time.Second): } - checkCount++ - // Check RBAC hook status var status int err := restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx).StatusCode(&status).Error() // If hook is ready, no deadlock if err == nil && status == 200 { klog.V(4).Info("RBAC bootstrap hook completed successfully") return } // Hook not ready - check if etcd is healthy etcdHealthy, etcdErr := isEtcdHealthy(ctx) if etcdErr != nil { klog.V(4).Infof("Could not check etcd health: %v", etcdErr) continue } if etcdHealthy { + checkCount++ + deadlockPredicateObserved = true klog.Warningf("RBAC bootstrap hook not ready (check %d/%d, elapsed %v), but etcd is healthy - potential deadlock", checkCount, maxChecks, time.Since(startTime).Round(time.Second)) } else { // etcd not healthy - not a deadlock, just waiting for etcd klog.V(4).Infof("RBAC hook waiting, etcd not yet healthy (check %d/%d)", checkCount, maxChecks) - // Reset counter since this isn't a deadlock condition - // Note: wall-clock deadline (startTime) is NOT reset - flapping cannot extend indefinitely checkCount = 0 } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/kube-apiserver.go` around lines 468 - 515, The loop currently can fall through to close(deadlockDetected) on wall-clock timeout even if etcd was never observed healthy and also increments checkCount before confirming etcd health; fix by tracking whether etcd has ever been observed healthy (e.g., add sawEtcdHealthy bool set to true when isEtcdHealthy(ctx) returns (true, nil)), only increment checkCount when etcdHealthy==true and etcdErr==nil (move the checkCount++ after the etcd health success path), and after the loop only signal deadlock (close(deadlockDetected)) if sawEtcdHealthy is true and the RBAC hook still isn’t ready; leave the wall-clock timeout to break the loop but not unconditionally declare a deadlock. Ensure references: restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(...).StatusCode(&status).Error(), isEtcdHealthy(ctx), checkCount, sawEtcdHealthy, and deadlockDetected.
🧹 Nitpick comments (1)
pkg/controllers/kube-apiserver.go (1)
566-568: Consider context-aware wait.The 1s
time.Sleepisn't interruptible. Low impact but aselectwithctx.Done()would be cleaner for graceful shutdown.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/kube-apiserver.go` around lines 566 - 568, Replace the uninterruptible time.Sleep(1 * time.Second) with a context-aware wait so shutdown can be cancelled: where the sleep occurs (the time.Sleep call in kube-apiserver.go cleanup/wait block), use a select that waits on time.After(1 * time.Second) and also listens on ctx.Done(), returning early if ctx is cancelled; ensure you reference the existing ctx variable and keep the same timeout duration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 468-515: The loop currently can fall through to
close(deadlockDetected) on wall-clock timeout even if etcd was never observed
healthy and also increments checkCount before confirming etcd health; fix by
tracking whether etcd has ever been observed healthy (e.g., add sawEtcdHealthy
bool set to true when isEtcdHealthy(ctx) returns (true, nil)), only increment
checkCount when etcdHealthy==true and etcdErr==nil (move the checkCount++ after
the etcd health success path), and after the loop only signal deadlock
(close(deadlockDetected)) if sawEtcdHealthy is true and the RBAC hook still
isn’t ready; leave the wall-clock timeout to break the loop but not
unconditionally declare a deadlock. Ensure references:
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(...).StatusCode(&status).Error(),
isEtcdHealthy(ctx), checkCount, sawEtcdHealthy, and deadlockDetected.
---
Nitpick comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 566-568: Replace the uninterruptible time.Sleep(1 * time.Second)
with a context-aware wait so shutdown can be cancelled: where the sleep occurs
(the time.Sleep call in kube-apiserver.go cleanup/wait block), use a select that
waits on time.After(1 * time.Second) and also listens on ctx.Done(), returning
early if ctx is cancelled; ensure you reference the existing ctx variable and
keep the same timeout duration.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 8aa2f8a5-61b0-40e5-a4e9-2c35b99a9835
📒 Files selected for processing (1)
pkg/controllers/kube-apiserver.go
|
/retest |
|
@copejon: This pull request references USHIFT-6401 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead. This pull request references USHIFT-6788 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
…tion On MicroShift restart, the RBAC bootstrap hook can deadlock when etcd contains existing data. The hook uses context.TODO() for API calls, which has no timeout. When the loopback client hangs, this creates a circular dependency where the hook waits for the API server while the API server waits for the hook to complete. This change adds a parallel deadlock detector that: - Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically - Checks if etcd is healthy while the hook is stuck - Detects deadlock in ~15 seconds instead of waiting 60 seconds - Restarts microshift-etcd.scope to recover from the deadlock This breaks the crash loop by detecting the condition early and taking recovery action at the MicroShift level, without requiring changes to vendored upstream Kubernetes code. Related upstream issues: kubernetes/kubernetes#86715, #97119 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add rbacHookMaxWaitDuration (30s) as an absolute deadline that cannot be reset by etcd health state changes. This prevents a flapping etcd from extending the deadlock detection indefinitely. The existing checkCount logic is preserved for detecting deadlock when etcd is consistently healthy, but the wall-clock deadline provides a hard upper bound regardless of etcd state transitions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…itives - Only increment checkCount when deadlock predicate confirmed (RBAC not ready AND etcd healthy) - Skip counting when RBAC probe or etcd health check errors - On wall-clock timeout, only trigger recovery if checkCount >= maxChecks - Add 1s timeout to RBAC probe to prevent hanging on unresponsive API - Add 5s timeout to systemctl stop to prevent recovery path from stalling - Extract rbacHookPollDelayStart constant for clarity Prevents false positive deadlock detection when etcd flaps or probes error, ensuring close(deadlockDetected) only fires after confirming the deadlock condition the required number of times. Related: kubernetes/kubernetes#86715, #97119 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
7daee5d to
184e1cb
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 59-69: rbac timing constants rbacHookDeadlockTimeout and
rbacHookCheckInterval are plain integers but used as time.Duration; change them
to durations by multiplying by time.Second (rbacHookDeadlockTimeout = 15 *
time.Second and rbacHookCheckInterval = 2 * time.Second), then update the
log/format that prints the timeout to use %s (duration) instead of “%d seconds”,
and remove any places that multiply rbacHookCheckInterval by time.Second (e.g.
the spot that computes maxChecks / or multiplies the interval) so all arithmetic
uses time.Duration consistently.
- Around line 496-515: The current RBAC probe logic treats any non-nil err from
restClient.Get() as a transport error and skips deadlock counting; change the
condition so only requests with no HTTP status code (transport/timeout errors)
are skipped — i.e., update the check around restClient.Get()/probeCtx to
skip/continue only when err != nil AND status == 0 (use the existing status
integer and err from the Do(...).StatusCode(&status).Error() call), while
allowing HTTP responses (even 5xx where status != 0) to be counted toward the
deadlock check and preserve the existing klog message for the skipped case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 7ca5fc05-cf7b-4e48-970f-e7400f1307ef
📒 Files selected for processing (1)
pkg/controllers/kube-apiserver.go
| // rbacHookDeadlockTimeout is the time to wait for the RBAC bootstrap hook | ||
| // before declaring a deadlock. This is shorter than kubeAPIStartupTimeout | ||
| // to allow for faster recovery. | ||
| rbacHookDeadlockTimeout = 15 | ||
| // rbacHookCheckInterval is how often to check the RBAC hook status | ||
| rbacHookPollDelayStart = 5 * time.Second | ||
| rbacHookCheckInterval = 2 | ||
| // rbacHookMaxWaitDuration is the absolute maximum time to wait for the RBAC hook | ||
| // regardless of etcd health state changes. This prevents flapping from extending | ||
| // detection indefinitely. | ||
| rbacHookMaxWaitDuration = 30 * time.Second |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Expected current result: negative maxChecks. Expected fixed result: 5.
python - <<'PY'
second = 1_000_000_000
n = 15 - 5 * second
go_trunc_div = -(abs(n) // 2)
print("current maxChecks:", go_trunc_div)
print("fixed maxChecks:", (15 * second - 5 * second) // (2 * second))
PY
rg -n 'rbacHookDeadlockTimeout|rbacHookPollDelayStart|rbacHookCheckInterval|maxChecks' pkg/controllers/kube-apiserver.goRepository: openshift/microshift
Length of output: 1035
Fix type mismatch in RBAC timing constants.
Lines 62 and 65 define integer values but are used in Duration arithmetic at line 469, producing maxChecks = -2,499,999,992. This causes the deadlock check to trigger immediately, breaking RBAC bootstrap detection.
Add * time.Second to lines 62 and 65:
rbacHookDeadlockTimeout = 15 * time.SecondrbacHookCheckInterval = 2 * time.Second
Then update line 446 to use %s format instead of %d seconds, and remove the * time.Second conversion from line 493 since rbacHookCheckInterval becomes a Duration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controllers/kube-apiserver.go` around lines 59 - 69, rbac timing
constants rbacHookDeadlockTimeout and rbacHookCheckInterval are plain integers
but used as time.Duration; change them to durations by multiplying by
time.Second (rbacHookDeadlockTimeout = 15 * time.Second and
rbacHookCheckInterval = 2 * time.Second), then update the log/format that prints
the timeout to use %s (duration) instead of “%d seconds”, and remove any places
that multiply rbacHookCheckInterval by time.Second (e.g. the spot that computes
maxChecks / or multiplies the interval) so all arithmetic uses time.Duration
consistently.
| // Check RBAC hook status | ||
| probeCtx, cancel := context.WithTimeout(ctx, time.Second) | ||
| var status int | ||
| err := restClient.Get(). | ||
| AbsPath("/readyz/poststarthook/rbac/bootstrap-roles"). | ||
| Do(probeCtx). | ||
| StatusCode(&status). | ||
| Error() | ||
| cancel() | ||
|
|
||
| // If hook is ready, no deadlock | ||
| if err == nil && status == 200 { | ||
| klog.V(4).Info("RBAC bootstrap hook completed successfully") | ||
| return | ||
| } | ||
|
|
||
| // If RBAC probe errored, skip this iteration (don't count toward deadlock) | ||
| if err != nil { | ||
| klog.V(4).Infof("RBAC probe error (not counting toward deadlock): %v", err) | ||
| continue |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n 'k8s.io/client-go' go.mod
modcache="$(go env GOPATH 2>/dev/null)/pkg/mod"
rg -n -C4 'func \(r Result\) (Error|StatusCode)\(' "$modcache"/k8s.io/client-go@* 2>/dev/null || trueRepository: openshift/microshift
Length of output: 212
🏁 Script executed:
cat -n pkg/controllers/kube-apiserver.go | sed -n '490,520p'Repository: openshift/microshift
Length of output: 1110
🏁 Script executed:
find . -path ./deps -prune -o -type f -name "*.go" -print | xargs grep -l "type Result" 2>/dev/null | head -5Repository: openshift/microshift
Length of output: 351
🏁 Script executed:
# Look for the local client-go implementation
find ./deps -type f -name "*.go" -path "*client-go*" | grep -i result | head -5Repository: openshift/microshift
Length of output: 698
🏁 Script executed:
# Search for Result type and Error method definition in the local client-go
rg -A10 "type Result struct" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/ 2>/dev/null | head -30Repository: openshift/microshift
Length of output: 1171
🏁 Script executed:
rg -A5 "func \(r \*Result\) Error\(\)" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.goRepository: openshift/microshift
Length of output: 46
🏁 Script executed:
rg -A5 "func \(r \*Result\) StatusCode\(" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.goRepository: openshift/microshift
Length of output: 46
🏁 Script executed:
# Look at the Do() method to understand what happens
rg -B5 -A15 "func \(r \*Request\) Do\(ctx" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | head -50Repository: openshift/microshift
Length of output: 742
🏁 Script executed:
rg -A20 "func \(r \*Request\) transformResponse" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | head -60Repository: openshift/microshift
Length of output: 1126
🏁 Script executed:
# Search for StatusCode and Error methods more broadly
grep -n "func.*Result.*StatusCode\|func.*Result.*Error" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.goRepository: openshift/microshift
Length of output: 158
🏁 Script executed:
# Get a broader view of Result methods
rg -n "^func \(r.*Result\)" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.goRepository: openshift/microshift
Length of output: 486
🏁 Script executed:
sed -n '1433,1448p' ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.goRepository: openshift/microshift
Length of output: 664
🏁 Script executed:
sed -n '1489,1515p' ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.goRepository: openshift/microshift
Length of output: 1043
🏁 Script executed:
# Also check transformResponse to see if statusCode is populated for error responses
rg -A50 "func \(r \*Request\) transformResponse" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | grep -A50 "resp.StatusCode"Repository: openshift/microshift
Length of output: 46
🏁 Script executed:
# Let's see more of transformResponse to understand when err is set
sed -n '1269,1350p' ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.goRepository: openshift/microshift
Length of output: 3859
🏁 Script executed:
# Check where statusCode is assigned in transformResponse
rg -n "statusCode.*=" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | head -10Repository: openshift/microshift
Length of output: 151
🏁 Script executed:
# Search for where statusCode is assigned in transformResponse
rg -B5 -A5 "result.statusCode\|statusCode:" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | head -80Repository: openshift/microshift
Length of output: 46
🏁 Script executed:
# Get the full transformResponse implementation
rg -A100 "^func \(r \*Request\) transformResponse\(ctx context.Context" ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go | head -150Repository: openshift/microshift
Length of output: 3722
HTTP error responses should be distinguished from transport errors.
When the server returns HTTP 5xx status codes, transformResponse() sets both statusCode and err fields in the Result. The current condition at line 513 treats this same as network/timeout errors and skips the iteration. This prevents the deadlock detector from counting the "hook not ready" state.
Modify the condition to skip only when there's no HTTP status code (transport/timeout error):
Proposed fix
- err := restClient.Get().
+ result := restClient.Get().
AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").
Do(probeCtx).
- StatusCode(&status).
- Error()
+ StatusCode(&status)
+ err := result.Error()
cancel()
// If hook is ready, no deadlock
if err == nil && status == 200 {
klog.V(4).Info("RBAC bootstrap hook completed successfully")
return
}
- // If RBAC probe errored, skip this iteration (don't count toward deadlock)
- if err != nil {
+ // If no HTTP status was received, skip this iteration (transport/timeout).
+ if err != nil && status == 0 {
klog.V(4).Infof("RBAC probe error (not counting toward deadlock): %v", err)
continue
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controllers/kube-apiserver.go` around lines 496 - 515, The current RBAC
probe logic treats any non-nil err from restClient.Get() as a transport error
and skips deadlock counting; change the condition so only requests with no HTTP
status code (transport/timeout errors) are skipped — i.e., update the check
around restClient.Get()/probeCtx to skip/continue only when err != nil AND
status == 0 (use the existing status integer and err from the
Do(...).StatusCode(&status).Error() call), while allowing HTTP responses (even
5xx where status != 0) to be counted toward the deadlock check and preserve the
existing klog message for the skipped case.
|
@copejon: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitelyProblem
MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The
rbac/bootstrap-rolesPostStartHook usescontext.TODO()for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119
Solution
Add a fail-fast deadlock detector that:
/readyz/poststarthook/rbac/bootstrap-rolesspecificallymicroshift-etcd.scopeto force clean recoveryTest plan
make verify-gopassesmake test-unitpasses🤖 Generated with Claude Code via
/jira:solve USHIFT-6401, [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)Summary by CodeRabbit